[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes#1878
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes#1878jkbradley wants to merge 15 commits intoapache:masterfrom
Conversation
* Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test)
* Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
* numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
|
Q: Is the Python SparseVector.toDense() function too big an API update? |
|
QA tests have started for PR 1878. This patch merges cleanly. |
|
Jenkins, retest this please. |
|
QA tests have started for PR 1878. This patch merges cleanly. |
There was a problem hiding this comment.
I don't quite understand why putting random data generation and sampling in a single example file. We can demo generating random uniform/normal/guassian/poisson RDDs in one example, and then stratified sampling in another (e.g., sampling based on the label to re-balance positive/negative examples).
There was a problem hiding this comment.
Sure, I can separate them. I'll call them random_rdds.py and sampled_rdds.py
Split RandomAndSampledRDDs into RandomRDDGeneration and SampledRDDs. (The name RandomRDDGeneration is to avoid a naming conflict with RandomRDDs.) RandomRDDGeneration prints first 5 samples Did same split for Python: random_rdd_generation.py and sampled_rdds.py Other small updates based on code review.
|
@mengxr Thanks for the comments! Updated accordingly. |
|
QA tests have started for PR 1878 at commit
|
|
QA tests have finished for PR 1878 at commit
|
|
QA tests have started for PR 1878 at commit
|
|
QA tests have finished for PR 1878 at commit
|
|
@mengxr It looks like the failures are in other tests; how best to proceed? With respect to the case class Params, is it OK to have them public since they are in examples? (Other examples have them public too.) |
|
Jenkins, test this please. |
|
QA tests have started for PR 1878 at commit
|
There was a problem hiding this comment.
There is a chance that keyCountsB doesn't contains key. It is safer to use keyCountsB.getOrElse here.
|
@jkbradley I tested the examples and found that |
|
QA tests have finished for PR 1878 at commit
|
|
@mengxr Thanks! I'll send the tree fixes in the other PR I sent just now on treeAggregate(), and I will do the keyCount fix in this PR. |
…or division by 0 and for missing key in maps.
|
Jenkins, test this please. |
|
QA tests have started for PR 1878 at commit
|
|
@mengxr Hopefully ready pending Jenkins |
|
QA tests have finished for PR 1878 at commit
|
|
Driver suite test failed...merging with updated master and trying again. |
|
Jenkins, test this please. |
|
QA tests have started for PR 1878 at commit
|
|
QA tests have finished for PR 1878 at commit
|
|
Jenkins, retest this please. |
|
QA tests have started for PR 1878 at commit
|
|
QA tests have finished for PR 1878 at commit
|
|
LGTM. Merged into master and branch-1.1. Thanks! |
Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test) CC: mengxr dorx Main changes were examples to show usage across APIs. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits: ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps. 8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN. b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan. 32173b7 [Joseph K. Bradley] Stats examples update. c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. 65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey 064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) (cherry picked from commit c8b16ca) Signed-off-by: Xiangrui Meng <meng@databricks.com>
Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test) CC: mengxr dorx Main changes were examples to show usage across APIs. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1878 from jkbradley/mllib-stats-api-check and squashes the following commits: ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps. 8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN. b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan. 32173b7 [Joseph K. Bradley] Stats examples update. c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. 65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey 064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for statistical summarization:
** Tests: correlation, MultivariateOnlineSummarizer
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
RowMatrix.scala
Python SparseVector (pyspark/mllib/linalg.py)
python/run-tests script
CC: @mengxr @dorx Main changes were examples to show usage across APIs.